San Fransico Zillow Weekly Sales Forecasting Final

Section 1

EDA and TS Decomposition

Data Load and Split

Code
sf_week_sales<- read_excel("SF_Zillow_Weekly_Sales.xlsx")

 sales_tbl_ts <- sf_week_sales %>% dplyr::select(date, Median_Sales_Price) %>%
  mutate(y = Median_Sales_Price) %>%
  mutate(ds = yearweek(date)) %>%
  as_tsibble(index = ds)
 
 sales_train <- sales_tbl_ts %>%  # Train set
  filter(ds<ymd("2022-07-09"))

sales_test <- sales_tbl_ts %>% # Test set
  filter(ds>=ymd("2022-07-09"))

Data Source and Description

There are huge finances that are involved while buying a property. In order to make that investment in the right time and right property, it is very important to understand the factors that drives the price and more important what would be the forecast price in future for that property.One of biggest player in the real state industry is Zillow who deals in buying, selling and renting a home.

Data Source : Zillow Weekly Median Sales Price

We have median weekly sales data for the houses sold in San Francisco region from 1st week of Jan 2018 to 1st week of Jan 2023.

Code
sales_train %>% ggplot() +
  geom_line(aes(ds,y)) +
  xlab("Year Week") +
  ylab("Median Sales Price")+
  ggtitle("Line Plot of Median Sales Price")+
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    plot.subtitle = element_text(size = 16),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    panel.background = element_rect(fill = 'lightblue', color ='purple'))

We can see multiplicative seasonality from the plot.

  • Seasonality ( Up trend during summer months and downtrend during holiday period)
  • Mean value is increasing over the year

Summary Stats

Mean/Median

Code
summary(sales_train$y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 846500  922500  973972  967392  998724 1099750 

Standard Deviation

Code
sd(sales_train$y)
[1] 58959

Mean of sales price is $971991 and median is $977750 with standard deviation of $58875.3

Distribution of Timeseries

Code
hist <- sales_train %>%
  ggplot() +
  geom_histogram(aes(y),colour = 4, fill = "white", bins = 20) +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    plot.subtitle = element_text(size = 16),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    panel.background = element_rect(fill = 'lightblue', color ='purple'))


dens <- sales_train %>%
  ggplot() +
  geom_density(aes(y),colour = 4, fill = "white") +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    plot.subtitle = element_text(size = 16),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    panel.background = element_rect(fill = 'lightblue', color ='purple'))


violin <- sales_train %>%
  ggplot() +
  geom_violin(aes("", y),colour = 4, fill = "white") +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    plot.subtitle = element_text(size = 16),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    panel.background = element_rect(fill = 'lightblue', color ='purple')) 


boxplot <- sales_train %>%
  ggplot() +
  geom_boxplot(aes("", y),colour = 4, fill = "white") +
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    plot.subtitle = element_text(size = 16),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    panel.background = element_rect(fill = 'lightblue', color ='purple')) 


ggarrange(hist, dens, violin,boxplot , 
          labels = c("A", "B", "C","D"),
          ncol = 2, nrow = 2)

  • We don’t see any outlines in the data
  • Mean and Median are very close.
  • Data is evenly distributed and is normally distributed

Moving Average of Timeseries

Code
sales_ma_data <- sales_train %>%
  arrange(ds) %>%
  mutate(
    ma_13_left = rollapply(
      y,
      13,
      FUN = mean,
      align = "left", fill = NA
    ),
    ma_13_right = rollapply(
      y,
      13,
      FUN = mean,
      align = "right", fill = NA
    ),
    ma_13_center = rollapply(
      y,
      13,
      FUN = mean,
      align = "center", fill = NA
    )
  ) %>%
  mutate(
    value_ma_3 = rollapply(y, 3, FUN = mean, align = "center", fill = NA),
    value_ma_5 = rollapply(y, 5, FUN = mean, align = "center", fill = NA),
    value_ma_7 = rollapply(y, 7, FUN = mean, align = "center", fill = NA),
    value_ma_13 = rollapply(y, 13, FUN = mean, align = "center", fill = NA),
    value_ma_25 = rollapply(y, 25, FUN = mean, align = "center", fill = NA),
    value_ma_49 = rollapply(y, 49, FUN = mean, align = "center", fill = NA)
  )


sales_tbl_pivot <- sales_ma_data %>%
  pivot_longer(
    cols = ma_13_left:value_ma_49,
    values_to = "value_ma",
    names_to = "ma_order"
  ) %>%
  mutate(ma_order = factor(
    ma_order,
    levels = c(
      "ma_13_center",
      "ma_13_left",
      "ma_13_right",
      "value_ma_3",
      "value_ma_5",
      "value_ma_7",
      "value_ma_13",
      "value_ma_25",
      "value_ma_49"
    ),
    labels = c(
      "ma_13_center",
      "ma_13_left",
      "ma_13_right",
      "value_ma_3",
      "value_ma_5",
      "value_ma_7",
      "value_ma_13",
      "value_ma_25",
      "value_ma_49"
    )
  ))

x <- sales_tbl_pivot %>%
  filter(
    !ma_order %in% c(
      "ma_13_center",
      "ma_13_left",
      "ma_13_right",
      "value_ma_7",
      "value_ma_49"
    )
  ) %>%
  mutate(ma_order = case_when(
    ma_order=='value_ma_3'~'3rd Order',
    ma_order=='value_ma_5'~'5th Order',
    ma_order=='value_ma_13'~'13th Order',
    ma_order=='value_ma_25'~'25th Order')
  ) %>%
  mutate(
    ma_order = factor(
      ma_order,
      labels = c('3rd Order',
      '5th Order',
      '13th Order',
      '25th Order'),
      levels = c('3rd Order',
      '5th Order',
      '13th Order',
      '25th Order')
    )
  ) %>%
  ggplot() +
  geom_line(aes(ds, y), size = 1) +
  geom_line(aes(ds, value_ma, color = ma_order), size = 1) +
    scale_color_discrete(name = 'MA Order')+
    theme(
    plot.title = element_text(size = 20, face = "bold"),
    plot.subtitle = element_text(size = 16),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    panel.background = element_rect(fill = 'lightblue', color ='purple'))+
  ylab('Median Sales Price') +
  xlab('Year Week')+
  ggtitle('Moving Average Graph')


x

We are trying to see the moving average based on the 3rd,5th,13th and 25th order. From the graph 13th order moving avg. explains the seasonality others will be either over fitting or under fitting.

STL Decomposition

Code
sales_train %>%
  model(
    STL(y)
  ) %>%
  components() %>%
  autoplot()+
   theme(
    plot.title = element_text(size = 20, face = "bold"),
    plot.subtitle = element_text(size = 16),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    panel.background = element_rect(fill = 'lightblue', color ='purple'))+
  xlab('Year Week')+
  ggtitle('STL Decomposition Graphs')

As we have observed seasonality we are going ahead with STL decomposition. We are able to explain the seasonality but still we see a cyclic trend in the residual as it moves up and down.We won’t be able to certify it as a white noise.

Section 2

ARIMA Modeling

Rolling SD of Differenced Series

As we see from the above plot we can clearly see mean and SD variance. We will try to check the variance after taking the rolling difference.

Code
sales_diff <- sales_train %>%
  mutate(value_diff = y - lag(y)) %>%
  as_tsibble(index=ds)


sales_diff %>%
mutate(
    diff_sd = zoo::rollapply(
      y, 
      FUN = sd, 
      width = 12, 
      fill = NA)) %>%
ggplot()+
geom_line(aes(ds,diff_sd))+
geom_smooth(aes(ds,diff_sd),method='lm',se=F)+
 theme(
    plot.title = element_text(size = 20, face = "bold"),
    plot.subtitle = element_text(size = 16),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    panel.background = element_rect(fill = 'lightblue', color ='purple'))+
ggtitle("Standard Deviation of Differenced Sales Price,  over Time") +
ylab("SD of Differenced Sales Price") +
xlab("Year Week")
`geom_smooth()` using formula = 'y ~ x'

After doing the rolling SD of differences series we can see negligible variance, based on the output we don’t need to do log transformation or BoxCox transformation.

Seasonal Differenced

As its a property sales price data and we have observed seasonality, in the next step we will check the seasonal difference. We are using a value difference of 1 year and taking 1 lag after that as seasonality was still present after 1 difference.

Code
sales_diff <- sales_diff %>% 
            mutate(value_seasonal = difference(y,54))

sales_diff <- sales_diff %>% 
            mutate(value_seasonal_1 = difference(value_seasonal,1))


sales_diff %>%
  gg_tsdisplay(value_seasonal_1,plot_type='partial', lag=36) +
  labs(title="Seasonally  differenced", y="")

KPSS test.

Code
seasonal_value_kpss = sales_diff %>% 
     features(value_seasonal_1, unitroot_kpss)
seasonal_value_kpss
# A tibble: 1 × 2
  kpss_stat kpss_pvalue
      <dbl>       <dbl>
1     0.138         0.1

The p-value is > 0.05, so we can say the seasonal value that we have derived is is stationary now.

ACF and PACF

ACF and PACF plots using the seasonal data.

Code
par(mfrow = c(1, 2))
acf(sales_diff$value_seasonal_1, na.action = na.pass)
pacf(sales_diff$value_seasonal_1, na.action = na.pass)

Based on the data from the lag plots and stationary we can assume ARIMA(2,1,0)(1,1,0)

BIC Model Comparision

Build and Compare Some Models by BIC, based on the below result set. Model 7 with ARIMA (2,1,2) is the best model. So our estimation and ARIMA predication are same.

Code
models_bic = sales_diff %>%
  
  model(
    mod1 = ARIMA(y~pdq(0,1,0)+PDQ(0,1,0)),
    mod2 = ARIMA(y~pdq(2,1,1)+PDQ(0,1,0)),
    mod3 = ARIMA(y~pdq(1,1,0)+PDQ(0,1,0)),
    mod4 = ARIMA(y~pdq(2,1,0)+PDQ(1,1,0)),
    mod5 = ARIMA(y~pdq(1,1,1)+PDQ(0,1,0)),
    mod6 = ARIMA(y~pdq(3,1,1)+PDQ(0,1,0)),
    mod7 = ARIMA(y~pdq(2,1,2)+PDQ(1,1,0))
  )

models_bic %>%
  glance() %>%
  arrange(BIC)
# A tibble: 7 × 8
  .model    sigma2 log_lik   AIC  AICc   BIC ar_roots   ma_roots 
  <chr>      <dbl>   <dbl> <dbl> <dbl> <dbl> <list>     <list>   
1 mod7   27188626.  -1803. 3619. 3619. 3638. <cpl [54]> <cpl [2]>
2 mod4   31127619.  -1817. 3642. 3642. 3654. <cpl [54]> <cpl [0]>
3 mod2   39906073.  -1831. 3669. 3669. 3682. <cpl [2]>  <cpl [1]>
4 mod6   40091658.  -1831. 3671. 3672. 3687. <cpl [3]>  <cpl [1]>
5 mod3   46851265.  -1844. 3693. 3693. 3699. <cpl [1]>  <cpl [0]>
6 mod5   46007411.  -1842. 3691. 3691. 3700. <cpl [1]>  <cpl [1]>
7 mod1   90917498.  -1904. 3811. 3811. 3814. <cpl [0]>  <cpl [0]>
Code
best_model <- models_bic %>% dplyr::select(mod7)

Section 3

Meta Prophet Model

Initial Prophet Model

Code
suppressMessages({
  model = prophet(sales_train)
  future = make_future_dataframe(model,periods = 27, freq = 'weeks')
  forecast = predict(model,future)

  plot(model,forecast)+
  ggtitle("Default Facebook Prophet Model")+
  ylab("SF Housing Sales Prices")+xlab("Date")+
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    plot.subtitle = element_text(size = 16),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    panel.background = element_rect(fill = 'lightblue', color ='purple'))

})

Interactive Visualization

Code
dyplot.prophet(model,forecast)

Prophet Decomposition

Code
prophet_plot_components(model,forecast)

The sales price has exhibited a steady upward trend since mid-2021, which is mainly attributed to the post-Covid-19 ease in the housing market. Additionally, there appears to be a seasonal pattern in the sales price, with prices rising during the summer months, specifically in June, and declining during the fall and end of the year.

Hyper Parameters

By adjusting the hyper parameters and trying multiple combinations we have got a realistic fitted line. We have adjusted changepoint.range = 0.90,n.changepoints=15,changepoint.prior.scale=0.1 parameters.

Code
suppressMessages({
model_new = prophet(sales_train,changepoint.range = 0.90,n.changepoints=15,changepoint.prior.scale=0.1)

forecast1 = predict(model_new,future)

plot(model_new,forecast1)+ add_changepoints_to_plot(model_new)+xlab("Year Week")+ylab("SF Housing Sales Prices")+  theme(
    plot.title = element_text(size = 20, face = "bold"),
    plot.subtitle = element_text(size = 16),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    panel.background = element_rect(fill = 'lightblue', color ='purple'))

})

Saturating Minimum/Maximum point

Defining the min and max point help the users help check the wild values in the plot and for will be very useful in stock price forecasting when the decision(sell/buy) need to be taken on price value. In our case its not very revelant.

Code
suppressMessages({
sixmnths_future_df = make_future_dataframe(model,periods = 27)
sixmnths_forecast_df = predict(model,sixmnths_future_df)

# Set "floor" in training data
sales_train$floor = 850000
sales_train$cap = 1500000
future$floor = 850000
future$cap = 1200000

# Set floor in forecsat data
sixmnths_future_df$floor = 850000
sixmnths_future_df$cap = 1200000
sat_model = prophet(sales_train,growth='logistic')

sat_six_mnths_forecast = predict(sat_model,sixmnths_future_df)

plot(sat_model,sat_six_mnths_forecast)+ylim(850000,1200000)+theme_bw()+xlab("Year Week")+ylab("Median Sales Price")
})

Seasonality Identification

Code
prophet_plot_components(model,forecast)

The sales price has exhibited a steady upward trend since mid-2021, which is mainly attributed to the post-Covid-19 ease in the housing market. Additionally, there appears to be a seasonal pattern in the sales price, with prices rising during the summer months, specifically in June, and declining during the fall and end of the year.

From the plot we can infer that it shows multiplicative seasonality.The amplitude of the seasonal fluctuations changes over time, the seasonality is multiplicative.

Holiday Impact

As its San Francisco sales data, lets check it by using the US holiday calender with the model.

Code
suppressMessages({
model = prophet(sales_train,fit=FALSE)
model = add_country_holidays(model,country_name = 'US')
model = fit.prophet(model,sales_train)
forecast2 = predict(model,future)
prophet_plot_components(model,forecast2)
})

We don’t see much holiday impact in the plot, couple of towers are present but its not something that is present in all years. We should not consider the holiday component for our time series.

Section 4

Model Comparison and Validation

In the model selection and validation we will compare the different models that we have created and will evaluate on similar metrics such as RMSE, MAPE etc. After the comparison we will select the best model and will run it on test data.

Cross-Validation

In-Sample Cross-Validation to Compare SNAIVE vs ARIMA vs Prophet Models

Based on the plot we can see that prophet model is close to the actual line. We can compare the accuracy by other metrics.

Code
cv_data <- sales_train %>%
  stretch_tsibble(.init = 54*3, .step = 27*1)


cv_forecast <- cv_data %>%
  model(
  naive_w_season = SNAIVE(y),
  arima_mod = ARIMA(y~pdq(2,1,2)+PDQ(1,1,0)),
  prophet_mod = fable.prophet::prophet(y ~ growth("linear", changepoint_range = 0.90,n_changepoints=15,changepoint_prior_scale=0.1) + season("year", type = "multiplicative")),
  ) %>%
  forecast(h = 26)


z <- cv_forecast %>%
    as_tsibble() %>%
    dplyr::select(-y) %>%
    left_join(
        sales_train, by ="ds"
    ) %>%
    ggplot()+
    geom_line(aes(ds,y))+
    geom_line(aes(ds,.mean,color=factor(.id),linetype=.model))+
    scale_color_discrete(name='Iteration')+
    ggtitle('SNAIVE vs ARIMA vs Prophet Forecast for Each CV Iteration')+
    theme(
    plot.title = element_text(size = 20, face = "bold"),
    plot.subtitle = element_text(size = 16),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    panel.background = element_rect(fill = 'lightblue', color ='purple'))+
    ylab("Median Sales Price")+
    xlab("Year Week")  

z

Average Accuracy Comparison

Code
cv_forecast %>%
  accuracy(sales_train) %>%
  data.table()
           .model .type         ME     RMSE      MAE        MPE     MAPE
1:      arima_mod  Test   6112.045 42480.69 34184.34  0.5067585 3.349510
2: naive_w_season  Test -19964.464 64302.86 55980.49 -2.3080673 5.665352
3:    prophet_mod  Test  23190.711 49520.15 36109.11  2.1636915 3.509316
        MASE     RMSSE      ACF1
1: 0.6345882 0.6411619 0.9533924
2: 1.0392057 0.9705244 0.9593905
3: 0.6703191 0.7474086 0.9352478

arima_mod model has a RMSE score of 102430.18 on the Test data, which suggests that its predictions are off by an average of that much compared to the actual values in the test set. Similarly, the prophet_mod model has a MAPE score of 3.524022 on the Test data, which suggests that its predictions have an average absolute percentage error of that much compared to the actual values in the test set.

Below we can check the RMSE of all models in T+1, T+2…

RMSE

Code
cv_forecast %>%
  group_by(.id,.model) %>%
  mutate(h = row_number()) %>%
  ungroup() %>%
  as_fable(response = "y", distribution = y) %>%
  accuracy(sales_train, by = c("h", ".model")) %>%
  ggplot(aes(x = h, y = RMSE,color=.model)) +
  geom_point()+
  geom_line()+
  theme(
    plot.title = element_text(size = 20, face = "bold"),
    plot.subtitle = element_text(size = 16),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    panel.background = element_rect(fill = 'lightblue', color ='purple'))+
  ggtitle('RMSE of Each Model at Different Intervals')+
  ylab('Average RMSE at Forecasting Intervals')+
  xlab('Year Week in the Future')

Based on the RMSE plotted above we can select ARIMA model as the best model for our test data. For the prophet model we have data gap that’s why the value has come down drastically.

We will go ahead with ARIMA pdq(2,1,2)+PDQ(1,1,0) as the best model.

Out of Sample Forecast

Code
best_model %>%
    forecast(h=26) %>%
    autoplot(sales_train %>%
    bind_rows(sales_test))+
    ylab('Median Sales Price')+
    xlab('Year Week')+
    theme(
    plot.title = element_text(size = 20, face = "bold"),
    plot.subtitle = element_text(size = 16),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    panel.background = element_rect(fill = 'lightblue', color ='purple'))

ARIMA model is working relatively well.

Final accuracy calculation based on the test set

Code
best_model %>%
    forecast(h=26) %>%
    accuracy(sales_test)
# A tibble: 1 × 10
  .model .type      ME   RMSE    MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>  <chr>   <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mod7   Test  -31466. 36513. 31466. -3.17  3.17   NaN   NaN 0.794

The best selected model, i.e., ARIMA pdq(2,1,2)+PDQ(1,1,0) model performed with 3.16% average deviation from real values during out of sample testing.